Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23280][SQL] add map type support to ColumnVector #20450

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 31, 2018

What changes were proposed in this pull request?

Fill the last missing piece of ColumnVector: the map type support.

The idea is similar to the array type support. A map is basically 2 arrays: keys and values. We ask the implementations to provide a key array, a value array, and an offset and length to specify the range of this map in the key/value array.

In WritableColumnVector, we put the key array in first child vector, and value array in second child vector, and offsets and lengths in the current vector, which is very similar to how array type is implemented here.

How was this patch tested?

a new test

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86866 has finished for PR 20450 at commit 8e66fb5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ColumnarMap extends MapData

@ueshin
Copy link
Member

ueshin commented Jan 31, 2018

We should also enable getMap() of ColumnarArray/ColumnarRow?

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

column.putArray(0, 0, 1)
column.putArray(1, 1, 2)
column.putNull(2)
column.putArray(3, 2, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column.putArray(3, 3, 0)?
Seems like line 670 is the same mistake?

Copy link
Contributor Author

@cloud-fan cloud-fan Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key array is: 0, 1, 2, 3, 4, 5
the value array is: 0, 2, 4, 6, 8, 10

Note that, map [1->2, 2->4] contributes 2 keys and values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so the offset of the next array should be 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, you are right!

}
}

// Populate it with maps [0->0], [1->2, 2->4], [], [3->6, 4->8, 5->10]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Populate it with maps [0->0], [1->2, 2->4], null, [], [3->6, 4->8, 5->10]?

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86868 has finished for PR 20450 at commit 2603d02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* To support map type, implementations must construct an {@link ColumnarMap} and return it in
* this method. {@link ColumnarMap} requires a {@link ColumnVector} that stores the data of all
* the keys of all the maps in this vector, and another {@link ColumnVector} that stores the data
* of all the values of all the maps in this vector, and an offset and length which specifies the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: an offset and length -> a pair of offset and length? Or specifies -> specify?

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86872 has finished for PR 20450 at commit 48c275f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Jan 31, 2018

Test build #86883 has finished for PR 20450 at commit 182b404.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

retest this please

@@ -530,7 +530,7 @@ public int putByteArray(int rowId, byte[] value, int offset, int length) {
@Override
protected void reserveInternal(int newCapacity) {
int oldCapacity = (nulls == 0L) ? 0 : capacity;
if (isArray()) {
if (isArray() || type instanceof MapType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we may also have a method isMap().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an overkill, isArray needs to take care of many types, but isMap we only accept one type: MapType.

*
* To support map type, implementations must construct an {@link ColumnarMap} and return it in
* this method. {@link ColumnarMap} requires a {@link ColumnVector} that stores the data of all
* the keys of all the maps in this vector, and another {@link ColumnVector} that stores the data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maps -> map entries ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keys of map entries sounds weird...

}

@Override
public int numElements() { return length; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numElements or length?

Copy link
Contributor Author

@cloud-fan cloud-fan Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a API from parent, we can't change it.

@jiangxb1987
Copy link
Contributor

LGTM only some nits and naming issues.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* key array with a index and a value from the value array with the same index contribute to
* an entry of this map type value.
*
* To support map type, implementations must construct an {@link ColumnarMap} and return it in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

construct an -> construct a.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very minor, may not worth to wait for another QA round. Maybe we can fix it in your "return null" PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. LGTM.

@jiangxb1987
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86896 has finished for PR 20450 at commit 182b404.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Feb 1, 2018
## What changes were proposed in this pull request?

Fill the last missing piece of `ColumnVector`: the map type support.

The idea is similar to the array type support. A map is basically 2 arrays: keys and values. We ask the implementations to provide a key array, a value array, and an offset and length to specify the range of this map in the key/value array.

In `WritableColumnVector`, we put the key array in first child vector, and value array in second child vector, and offsets and lengths in the current vector, which is very similar to how array type is implemented here.

## How was this patch tested?

a new test

Author: Wenchen Fan <[email protected]>

Closes #20450 from cloud-fan/map.

(cherry picked from commit 52e00f7)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 52e00f7 Feb 1, 2018
@viirya
Copy link
Member

viirya commented Feb 1, 2018

I found that we don't enable getMap API in MutableColumnarRow in this change, should we do it? If so, I can make a small follow-up PR for it.

@ueshin
Copy link
Member

ueshin commented Feb 1, 2018

@viirya Thanks! but I'm working on it. I'll do it soon.

@viirya
Copy link
Member

viirya commented Feb 1, 2018

@ueshin Ok. No problem. :)

asfgit pushed a commit that referenced this pull request Feb 1, 2018
## What changes were proposed in this pull request?

This is a follow-up of #20450 which broke lint-java checks.
This pr fixes the lint-java issues.

```
[ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData.
[ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData.
[ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData.
```

## How was this patch tested?

Checked manually in my local environment.

Author: Takuya UESHIN <[email protected]>

Closes #20468 from ueshin/issues/SPARK-23280/fup1.

(cherry picked from commit 8bb70b0)
Signed-off-by: Takuya UESHIN <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 1, 2018
## What changes were proposed in this pull request?

This is a follow-up of apache#20450 which broke lint-java checks.
This pr fixes the lint-java issues.

```
[ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData.
[ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarArray.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData.
[ERROR] src/main/java/org/apache/spark/sql/vectorized/ColumnarRow.java:[22,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.util.MapData.
```

## How was this patch tested?

Checked manually in my local environment.

Author: Takuya UESHIN <[email protected]>

Closes apache#20468 from ueshin/issues/SPARK-23280/fup1.
asfgit pushed a commit that referenced this pull request Feb 1, 2018
## What changes were proposed in this pull request?

This is a followup pr of #20450.
We should've enabled `MutableColumnarRow.getMap()` as well.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <[email protected]>

Closes #20471 from ueshin/issues/SPARK-23280/fup2.

(cherry picked from commit 89e8d55)
Signed-off-by: Takuya UESHIN <[email protected]>
ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 1, 2018
## What changes were proposed in this pull request?

This is a followup pr of apache#20450.
We should've enabled `MutableColumnarRow.getMap()` as well.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <[email protected]>

Closes apache#20471 from ueshin/issues/SPARK-23280/fup2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants